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Abstract 

Maximum likelihood fits to data can be performed using binned data and unbinned 
data. The likelihood fits in either case produce only the fitted quantities but not 
the goodness of fit. With binned data, one can obtain a measure of the goodness 
of fit by using the method, after the maximum likelihood fitting is performed. 
With unbinned data, currently, the fitted parameters are obtained but no measure of 
goodness of fit is available. This remains, to date, an unsolved problem in statistics. 
By considering the transformation properties of likelihood functions with respect 
to change of variable, we conclude that the likelihood ratio of the theoretically 
predicted probability density to that of the data density is invariant under change 
of variable and provides the goodness of fit. We show how to apply this likelihood 
ratio for binned as well as unbinned likelihoods and show that even the test is a 
special case of this general theory. In order to calculate errors in the fitted quantities, 
we need to solve the problem of inverse probabilities. We use Bayes' theorem to do 
this, using the data density obtained in the goodness of fit. This permits one to 
invert the probabilities without the use of a Bayesian prior. The resulting statistics 
is consistent with frequentist ideas. 
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1 Introduction 



In particle physics as well as other branches of science, fitting theoretical mod- 
els to data is a crucial end stage to the performance of experiments. Minimizing 
the between theory and experiment is perhaps the most commonly used 
form of fitting, with data binned in histograms. Such fits yield not only the 
fitted parameters and errors on the fitted parameters but also a measure of 
the goodness of fit. Another common fitting method is the maximum likeli- 
hood method which can be performed on binned and unbinned data to obtain 
the best values of theoretical parameters. In the case of unbinned likelihood 
fitting, there is currently no measure of the goodness of fit. In this paper, we 
propose a solution to the problem, which by its nature works generally for 
both binned and unbinned likelihood fits. A general theory of goodness of fit 
in hkelihood fits results. 

1.1 Notation 

In what follows, we will denote by the vector s, the theoretical parameters 
(s for "signal") and the vector c, the experimentally measured quantities or 
"configurations". For simplicity, we will illustrate the method where both s 
and c are one dimensional, though either or both can be multi-dimensional in 
practice. We thus define the theoretical model by the conditional probability 
density P{c\s), defined as the probability of observing c given a value of s. 
The theoretical probability function obeys the normalization condition 




(1) 
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Then an unbinned maximum likelihood fit to data is obtained by maximizing 
the hkelihood [1], 

c^uno.\s) (2) 

i=l 

where the likelihood is evaluated at the n observed data points Q,i = l,n. 
Such a fit will determine the maximum likelihood value s* of the theoretical 
parameters, but will not tell us how good the fit is. 



1.2 To show that C cannot be used as a goodness of fit variable 



The goodness of fit variable must be invariant under a change of variable 
c ^ c'. The value of the likelihood C at the maximum likelihood point does 
not furnish a goodness of fit, since the likelihood is not invariant under change 
of variable. This can be seen by observing that one can transform the variable 
set c to a variable set c' such that P(c'|s*) is uniformly distributed between 
and 1. In one dimension, this is trivially done by the transformation function 
c'(c) such that 

c 

c'(c) = J P{t\s*)dt (3) 

ci 

The variable c ranges from ci to C2 and the probability function P{c\s*) nor- 
malizes to unity in this range. This implies that c' ranges from to 1. Such a 
transformation is known as a hypercube transformation, in multi-dimensions. 
The transformed probability distribution in the variable c' is unity in this 
interval as can be seen by examining the Jacobian of the transformation 



1^1 

I dc I 



l|l=P(c|.-) (4) 
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P{c\s*) = P{c\s*)\ 




(5) 



Other datasets will yield different values of likelihood in the variable space 
c when the likelihood is computed with the original function P{c\s*). How- 
ever, in hypercube space, the value of the likelihood is unity regardless of the 
dataset c'^,i = l,n, thus the likelihood C cannot furnish a goodness of fit by 
itself, since neither the likelihood, nor ratios of likelihoods computed using 
the same distribution P{c\s*) is invariant under variable transformations. The 
fundamental reason for this non-invariance is that only a single distribution, 
namely, P{c\s*) is being used to compute the goodness of fit. 

To illustrate further, we use a concrete example of fitting a dataset using the 
maximum likelihood method as shown in Figure 1(a). The fitting is done in 
the range Ci < c < C2, where Ci = 1.0 and C2 = 5.0. The fitting function is 



which normalizes to unity in the range Ci < c < C2- The fitted dataset is shown 
as a full histogram. The dashed histogram shows a dataset that is a poor fit 
to the data and will produce a smaller value of C when fitted as a function of 
c. Figure 1(b) shows the same data in the hypercube space where the fitted 
function is fiat as per the transformation given in equation 3. Both the datasets 
will produce a value of unity for C in this space implying an equally good fit 
in either case, which is obviously false. This clearly demonstrates that the 
likelihood by itself cannot provide a goodness of fit variable. 




exp(— c/s) 



(6) 



s(exp(-ci/s) - exp(-C2/s)) 
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(b) 
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Fig. 1. (a) shows the fitting in the dataset space. The curve shows the fitted function. 
Superimposed is the fitted data, (full histogram, normalized to unity). The dashed 
histogram shows the different dataset which obviously does not fit to the fitted 
curve, (b) The same plot in hyperspace. The fitted function is fiat by construction. 
Both the fitted data set (full histogram) and the dashed histogram will have the 
same value of likelihood £. in this space which implies that £. cannot be used as a 
goodness of fit variable. 

2 Likelihood ratios 




2.1 The concept of "data likelihood" derived from the pdf of the data 

It is interesting to note that while using as the goodness of fit technique for 
binned histograms, we use two distribution functions, namely the theoretical 
curve and the data. By binning the data, we are in effect estimating the 
probability density function of the data as the second distribution, in addition 
to the theoretical distribution specified by the theoretical curve. In likelihood 
language we define the probability density function (pdf) of the data as 



"^«*«(c) = lim 1^ (7) 
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where N is the number of times the experiment is repeated that results in the 
observable c. The function p<^"*'*(^c) obeys the normalization condition 

J P'^"'{c)dc = 1 (8) 

When one is using binned hkelihoods, the pdf of the data would be estimated 
by binning the events in a histogram and normalizing the sum of contents of 
all bins to unity. In the unbinned case, we will describe below a technique [2] 
on estimating p'^"*«(c) using Probability Density Estimators (PDE). 



We can now define a likehhood ratio C-jz such that 

ni=?p(Q|s) _ p{c-\s) 



■'TZ 



(9) 



where we have used the notation Cn to denote the dataset Cj, i — l,n. 

Since the n events Ci,i — l,n are independent, the probability of obtaining 

the dataset cvj is given by 

i=n 

p^«*«(c;) = H p'^"'{ci) (10) 

i=l 

The quantity P'^"*"(c;) we name the "data likehhood" of the dataset and 
the quantity P{c^\s) as the "theory likelihood" of the dataset c^. We note 
that the "data hkelihood" P''"*"(c^) may also be thought of as the probability 
density of the" n — object" which obeys the normalization condition 



/ 



p'^»*«fc:,) dcz = 1 (111 



Let us now note that jCji is invariant under a general variable transformation 
(not restricted to hypercube transformation) c — > c', since 
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P'^«*«(c') = |^|P'^"*"(c) (13) 



^'n = (14) 

and the Jacobian of the transformation |^| cancels in the numerator and 
denominator in the ratio. This is an extremely important property of the 
likelihood ratio C-n that qualifies it to be a goodness of fit variable. Later, 
we will show that the binned likelihood ratio asymptotically approaches a 
distribution as the number of events n ^ oo, further motivating this choice. 
Since the denominator P'^^°'[cn) is independent of the theoretical parameters 
s, both the likelihood ratio and the likelihood maximize at the same point s*. 
The likehhood ratios for two different data sets and Cn can be combined 
by multiplication as per 



This rule follows from the definition of C-ji in equation 9. In practice, we will 
use the negative log-likelihood ratio NCCTZ = —logeJC-R as the goodness of 
fit variable and minimize it. The multiplication rule of equation 15 results 
in an addition rule for MjCjCTZ. The problem of finding the distribution of 
NCCTZ for a good fit then reduces to finding the distribution of NCCR, in 
hyper-cube space for a variable that is uniformly distributed between zero 
and one, as in Figure 1(b). This is because MCCTZ is invariant under the 
transformation of variable. So all goodness of fit problems using likelihood 
ratios can be reduced to finding the distribution of MCCR, for a variable that 
is uniformly distributed in hypercube space. 
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2.2 Historical use of Likelihood Ratios 

The Neyman-Pearson lemma [3] states that if one is trying to choose between 
two hypotheses Hq and Hi, then the cut on the likehhood ratio 





\Ho) 


P{Cn 


\Hi) 



will have the optimum power in differentiating between the hypotheses Hq 
and Hi, where e is a constant adjusted to obtain the desired purity in favor 
of hypothesis Hq. Notice that this likehhood ratio is between the hkelihood 
computed for two different hypotheses Hq and Hi. Our likelihood ratio differs 
fundamentally from this in that the denominator we use P{c^) is the "data 
hkelihood" that is computed from the distribution of the data and is not tied 
to any hypothesis as such. 



3 Normalizing the theoretical curve to the data 

The method of maximum likelihood fits the shape of the theoretical distribu- 
tion to the data distribution. The theoretical model obeys the normalization 
condition in equation 1 and the likelihood is evaluated at the number of ob- 
served data events n. There is no explicit mention of the theoretically expected 
number of events, which we denote by rif. Later we will show how to incor- 
porate a goodness of fit in the absolute normalization by making use of the 
binomial distribution and its limiting cases the Poisson and the normal distri- 
butions. We will begin by obtaining goodness of fit formulae for the case where 
we bin the data and fit the theoretical shape to the experimental distribution. 



10 



4 Binned Goodness of Fit 

When one bins data in histograms and fits the theory shape to the data, one 
can fit by using either maximum hkehhood or by minimizing y^. In either 
case, the goodness of fit is usually evaluated using x^. We now illustrate how 
the likehhood ratio defined in section 2 can be used to obtain a goodness of fit 
after the maximum likelihood fitting is done. In order to evaluate the likelihood 
ratio, one needs to evaluate the theory likelihood and the data likelihood for 
each value of q. For the binned histogram, we make the approximation of 
assuming that both these quantities are constant for all values of Cj in a given 
bin and evaluating each at the bin center. Let there be n;, bins and let the k^^ 
bin contain entries. 

Jf..! The multinomial distribution 

The probability of obtaining the histogram is given by the multinomial distri- 
bution 

I k=nt 

P{histogram) = [] nck\sr (17) 

k=ni, 

J2 rik^n (18) 
fe=i 

4-2 Degeneracy of the distribution 

The factor -^k=nf, — - denotes the number of ways n events can be partitioned 
to form the observed histogram, which we term the degeneracy V of the his- 
togram. Each of the V histograms is identical to each other and possesses the 
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same goodness of fit. We can then evaluate the goodness of fit for any one of 
the T> degenerate histograms, the hkehhood for which is given by 

£=n^(cfekr'= (19) 
fe=i 

and the hkehhood ratio can be written as 

k=ni, / p/ I \ \ Tifc 



k-- 



(20) 



The value of palta^^^^^ raised to the power Uk in equation 20 results from the 
fact that there are Uk configurations Cj in the /c*'* bin and we are multiplying 
a constant ratio (at the bin center) over Uk configurations. If Ack is the bin 
width for the A;*^ bin, then the data likelihood can be approximated by 

P<^a*«(c,) ^ ^ (21) 
This obeys the normalization condition 

k=ni, 



J P'"'"'(ck)dck ^ E ^^^^'^ = 1- (22) 



The theoretical likelihood can be integrated over the bin to yield 

pbin average ^^^^^^ ^ __ J p(^c\s)dc (23) 

'^c=Cfc-Acfc/2 



This obeys the normalization condition 

k=nb 

J2 P^^"»^^^»9«(cfc|s)Acfc = 1 (24) 

k=l 



Then the likelihood ratio can be written 

'=="6 /nAr, pbin average („ I ^\\ k=rn, /rn -sUk 

Cn=\{{^^^ [Msl\ (25) 

k=i \ rik ) fci^i KnJ 
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where = nAckP^"'°''"^^"'^^{ck\s) is the theoretically expected number of 
events in the fc*'* bin obeying the normalization condition J2kTk = n, as per 
equation 24. This likelihood ratio may be used to obtain a maximum likeli- 
hood fit as well as to obtain a goodness of fit. Note that the hkelihood ratio 

is well-behaved even for empty bins where = 0, since n^* is unity for such 
cases. 

Note that the negative log-hkelihood ratio MCCR, resulting from equation 25 
yields 

ni. 

NCCn^ Y,nklog,{^) (26) 

fe=l 



which is the same result as derived by Baker and Cousins [4] for the multino- 
mial case where normalization is preserved between theory and experiment. 
We have derived the result using very different arguments (than Baker and 
Cousins) for the denominator of the likelihood ratio, namely it is the value of 

the data pdf at the bin center as a result of the general theory developed here. 

If we are reluctant to work out (for reasons of computing speed) the integral 
in equation 23 for each bin at each step of the fitting process, then we can 
approximate it by the bin center values 



-pbin average ( \\ ^ P{Ck\s) 



This then obeys the normalization equation 24 and the expression in equa- 
tion 26 for MCCTZ can be used generally. 
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4-3 To Show that the Binned Negative Log-Likelihood Ratio Approaches a 
Distribution for Large n 



Let the difference between n^, the observed number of events and the 
theoretical number of events be denoted by Xk — rik — Tk- Then Xk — 0, by 
virtue of the normahzation conditions. Then the binned negative log likelihood 
ratio NCCTZ can be written 



NCCn = -log, Cn^-Y^rik log, U - — (28) 



k=\ 



nk. 



This can be expanded in powers of Xk/rik as 



NCCn = -log. Cn = + + ^i-f + ■ ■ ]m 



k=l 



Uk 1 Uk 3 rik 4 rik' 

k=nt 1 \2 1 \3 1 \4 



As n ^ oo, the individual bin contents become normally distributed about 
their expected value Tk with variance — nk{l — Uk/n) ~ rik for ^fe << "f^- 
This is true for all cases (named the null hypothesis) where the data and theory 
fit each other. Then we can write = Xk/rik and 

uccn^Y.\xl + \f, + \^,--- (31) 

k=l ^ ^fc ^°k 

For large n, ~ ^/n^ and the higher order terms may be neglected yielding 



k=rn, -j^ 

oo (32) 

k=i 2 
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This is an example of the hkehhood ratio theorem [5] . The expected value of 
the MCCTZ can then be written 

EmmJg\E(,i),\>^y-^y->^y-^,.. (33, 

where /J-s, /J'A, ■ ' ' the S^'', 4*^* • • • moments of the normal distribution about 
the mean. Since the normal distribution is symmetric about the mean, all the 
odd moments {/is, /^s • • •) are zero. The even moments of the normal distribu- 
tion (for integer I) are given by the formula 

Ii2i = 1.3.5 ■■■{21- l)a''^ (34) 
This yields 

k=ni, -| q 4 1 c 8 

EiJ^CCn) = E ^Eixl) + B + f ^ • • • (35) 
k=i ^ ^^k ^ 

All the remaining terms tend to zero as l/nk{= l/c^) as — >• oo leading to 



k=ni, -1 

EiAfCCTZ) = E -Eixl) = ^ (36) 

fc=i ^ ^ 
E{Cr) = exp(-nb/2) (37) 



The number of degrees of freedom for NCCR, would be — 1, due to the 
normalization condition '^fe — ^■ 

4-4 Normalizing theory and experiment and the problem of Goodness of fit 
for the Poisson distribution 

As we have pointed out, maximum likelihood fitting only fits the shape of the 
theoretical distribution to the experimental data. This is due to the normal- 
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ization condition of equation 1. However, if we employ a binomial distribution 
and define the first bin as containing the number of observed events n with the- 
oretical expectation of rit events, and the second bin to contain the number of 
unobserved events in N tries, then one can employ the formula in equation 25 
with Hb = 2 to obtain the likelihood ratio. 

= ('JiV f^Y'" = ("-iV f i^f" (38) 



nj \N — nJ \nj \l — n/N 



We now take the Poissonian limit of A?" — > oo with rit and n finite and the 
above likelihood ratio becomes 

£^ = e-("*-") (^^y (39) 

where we have employed the relations {N — n) — >■ iV and (1 — x/N)^ — >■ 
as A?" — > oo. 

Equation 39 provides the goodness of fit likelihood ratio for all Poissonian 
problems where Ut events are expected and n are observed. We can now mul- 
tiply this Poissonian Cr with equation 25 to produce the likelihood ratio for 
a general binned likelihood problem where the normalization for theory and 
experiment vary. 

k \ _(•»,. -j-i- / -t t. 

k=l 



=e-(--)nif) (40) 

^ ^ ^ k=l ^'^k'' \'^k, 



where we have defined T^. — utTk/n and J^T^. — rit. With this redefinition, we 
obtain the MCLTZ for the multinomial with theoretical normalization differing 
from the experimental one as 

ni. 

NCCn ^Y^Tj^-Uk + rik log^i^) (41) 

k=l ^k 
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This is same as the "Poissonian result" of Baker and Cousins [4] again derived 
using very different arguments for the denominator of the hkehhood ratio. 



4-5 The Gaussian limit of the binomial 



The Poissonian result is useful when Ut and n are relatively small numbers 
(<^ 25). When we have larger number of events, then the Gaussian approxi- 
mation is more relevant. We have already shown that (equation 30) that in a 
multinomial, the negative log likelihood ratio can be approximated by 

J^ccn^Y^ol^] (42) 

fe=l ^ \'''kj 

We apply this to the binomial with ni, — 2, ni — n, and n2 — N — n and 
Ai = — A2 — n — Ut- Then 

^^^^ ((Wiww) '''' 

2 \Npql 2(72 ^ ' 

where p — Ut/N ^ n/N is the probability of an event appearing in the first 
bin and q — 1 — p and = Npq is the variance of the bin contents of 
the first bin. We now let — > cxo, n — > 00 and >> n. In this case, 
the variance can be approximated by n and we have the Gaussian case with 
MCCR, = (n — UtY /2n). This MCCR, can be added to the one resulting from 
the maximum likelihood shape fitting to get an overall goodness of fit. 

We must emphasize once again that the method of maximum likelihood always 
fits theoretical shapes to experimental data. We have been able to circumvent 
this restriction by using the device of the binomial distribution where the 
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observed events n are in the first bin and the total number of events in the 
distribution refer to the "number of tries" and the second bin consists of the 
N—n events that failed to appear in the experiment. The binomial distribution 
is special in this regard since once we specify the properties of the first bin, 
the second bin is completely specified and anti-corrclatcd with the first bin. 
The number of tries is unknown, but we set it to infinity in two different limits 
as discussed resulting in the Poisson and the Gaussian hkelihood ratios. 

4 .6 To show that ^ is also the negative logarithm of a likelihood ratio 

The most commonly used method for goodness of fit is the test of Karl 
Pearson, which is used even when the quantities being fitted are not events 
but measurements with error bars. We show here that the measure is also 
twice the negative logarithm of a Gaussian likelihood ratio rather than the 
negative logarithm of a Gaussian likelihood, as is the popular misconception. 
Consider a binned histogram where the contents in the A;*'* bin is noted by 
Cfe and the theoretical expectation of this bin is s^. The standard error of the 
observed variable Ck is known to be ak- Then, one can write 

This leads to 

- log, {P(ck\sk)) = ^ + log,(v^afc) (46) 

Prom the above expression, people are mistakenly led to conclude that is 
equivalent to twice the negative log-likehhood. This ignores the term logg(-\/27rCfe) 
in the above equation, which varies from bin to bin. In order to work out the 
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likelihood ratio, we need to estimate the data density P{ck) at each measure- 
ment. The data points are distributed as a Gaussian with standard deviation 
CTfc. The best estimate of the mean of the Gaussian from the data alone is Cfe. 
This leads to 

p(cfe) - ^ cxp ( _ 1 ^47^ 



yielding the likelihood ratio 



The overall likelihood ratio is given by 

j0.r=U^R (49) 
fe=l 



leading to 

k=nb 

X' = 2 log, {Cn) ^ E xl (50) 

k=l 



i.e. is equal to twice the negative log-likelihood ratio and not the negative 
log-hkelihood!. 



5 Unbinned Goodness of Fit 

Very often the data are not plentiful enough to bin adequately and it is more 
efficient to perform an unbinned likelihood fit. Presently, a goodness of fit 
method does not exist for unbinned likelihood fits. Using the formalism de- 
veloped above, we present a solution. After the unbinned likelihood fit is per- 
formed by maximizing the likelihood in equation 2 one needs to work out 
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the data likelihood P'^°'*°'{cn) in order to evaluate the hkehhood ratio and the 
goodness of fit. We employ the technique of Probability Density Estimators 
{PDE's), also known as Kernel Density Estimators [2] (KDE's) to do this. 
The pdf p<^«*«(c) is approximated by 

P'^«*«(c) ^ PDE{c) = - 2 0{c - q) (51) 

where a Kernel function Q{c — q) is centered around each data point q, is so 
defined that it normalizes to unity. The choice of the Kernel function can vary 
depending on the problem. A popular kernel is the Gaussian defined in the 
multi-dimensional case as 

1 —H'^^d^c^ 
where E is the error matrix of the data defined as 

£;a,/3 ^a^0 > - < C" >< > (53) 



and the <> implies average over the n events, and d is the number of dimen- 
sions. The Hessian matrix H is defined as the inverse of E and the repeated 
indices imply summing over. The parameter /i is a "smoothing parameter", 
which has[6] a suggested optimal value h oc that satisfies the asymp- 

totic condition 

Qoo{c - Cj) = lim Q{c - Ci) = 5{c - q) (54) 



The parameter h will depend on the local number density and will have to be 
adjusted as a function of the local density to obtain good representation of the 
data by the PDE. Our proposal for the goodness of fit in unbinned likelihood 
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fits is thus the hkelihood ratio 



^ P{Cn\s) _ P{cl\s) 



evaluated at the maximum hkehhood point s*. 



5.1 An illustrative example 



We consider a simple one-dimensional case where the data is an exponential 
distribution, say decay times of a radioactive isotope. The theoretical predic- 
tion is given by 

P(c|s) = iexp(--) (56) 
s s 



We have chosen an exponential with s 
Kernel for the PDF would be given by 

where the variance a of the exponential is numerically equal to s. To be- 
gin with, we chose a constant value for the smoothing parameter, which for 
1000 events generated is calculated to be 0.125. Figure 2 shows the generated 
events, the theoretical curve P{c\s) and the PDE curve P(c) normalized to 
the number of events. The PDE fails to reproduce the data near the origin 
due to the boundary effect, whereby the Gaussian probabilities for events close 
to the origin spill over to negative values of c. This lost probability would be 
compensated by events on the exponential distribution with negative c if they 
existed. In our case, this presents a drawback for the PDE method, which we 



1.0 for this example. The Gaussian 



(57) 
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Fig. 2. Figure shows the histogram (with errors) of generated events. Superimposed 
is the theoretical curve P{c\s) and the PDE estimator (sohd) histogram with no 
errors. 

will remedy later in the paper using PDE definitions on the hypercube and 
periodic boundary conditions. For the time being, we will confine our example 
to values of c > 1.0 to avoid the boundary effect. 

In order to test the goodness of fit capabilities of the likelihood ratio £7^, 
we superimpose a Gaussian on the exponential and try and fit the data by a 
simple exponential. Figure 5 shows the "data" with 1000 events generated as 
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an exponential in the fiducial range 1.0 < c < 5.0. Superimposed on it is a 
Gaussian of 500 events. More events in the exponential are generated in the 
interval 0.0<c<1.0to avoid the boundary effect at the fiducial boundary at 
c=1.0. Since the number density varies significantly, we have had to introduce 

a method of iteratively determining the smoothing factor as a function of c. 

5.2 Iterative Determination of the Smoothing Factor 

The expression h 7t,-i/(<^+4) clearly is meant to give a smoothing factor 
that decreases slowly with increased statistics n. It is expected to be true 
on average over the whole distribution. However, the exponential distribution 
under consideration has event densities that vary by orders of magnitude as a 
function of the time variable c. In order to obtain a function h{c) that takes 
into account this variation, we first work out a PDE with constant h and then 
use the number densities obtained thus [8] to obtain h{c) as per the equation 



The equation is motivated by the consideration that a uniform distribution 
of events between ci and C2 has a pdf — l/(c2 — Ci) whereas the real pdf is 
approximated by PDE. The function h{c) thus obtained is used to work out a 
better PDE{c). This process is iterated three times to give the best smoothing 
function. 

We generate n^^lOOO events in the fiducial interval. If now we were to super- 
impose a Gaussian with 500 events centered at c=2.0 and width=0.2 on the 
data, the PDE estimator will follow the data as shown in Figure 5. This shows 
that the PDE estimator we have is adequate to reproduce the data, once the 




(58) 
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smoothing parameter is made to vary with the number density appropriately. 
The smoothing function h{c) for the events in Figure 5 is shown in Figure 3. 
It can be seen that the value of h increases for regions of low statistics and de- 
creases for regions of high statistics. Superimposed is the constant smoothing 
factor obtained by the equation h ~ 0.5n~^^^'^^^^ = O.Sw^'^, with n being the 
total number of events generated, including those outside the fiducial volume. 
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Fig. 3. The variation of /i as a function of c for the example shown in Figure 5. The 
variation of the smoothing parameter is obtained iteratively as explained in the text. 
The flat curve is a smoothing factor resulting from the formula h ~ 0.5n~^^^'^^^\ 
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5.3 An Empirical Measure of the Goodness of Fit 

The negative log-likelihood ratio MCCTZ = —loQeC-ji at the maximum likeh- 
hood point now provides a measure of the goodness of fit. In order to use this 
effectively, one needs an analytic theory of the sampling distribution of this 
ratio. This is difficult to arrive at, since this distribution is sensitive to the 
smoothing function used. If adequate smoothing is absent in the tail of the 
exponential, larger and broader sampling distributions of AfjCCTZ will result. 
One can however determine the distribution of AfCCTZ empirically, by gen- 
erating the events distributed according to the theoretical model many times 
and determining HCCTZ at the maximum likelihood point for each such dis- 
tribution. The solid histogram in figure 4 shows the distribution of MCC1Z for 
500 such fits. This has a mean of 2.8 and an rms of 1.8. The dotted histogram 
shows the corresponding value of MCCTZ for the constant value of smoothing 
factor shown in figure 3. This distribution is clearly broader (rms=2.63) with 
a higher mean(=9.1) and thus has less discrimination power in judging the 
goodness of fit than the solid curve. 

With this modification in the PDE, one gets a good description of the behavior 
of the data by the PDE as shown in Figure 5. We now vary the number of 
events in the Gaussian and obtain the value of the negative log likelihood ratio 
MCCTZ as a function of the strength of the Gaussian. Table 1 summarizes the 
results. The number of standard deviations the unbinned likelihood fit is from 
what is expected is determined empirically by plotting the value of HCCR, 
for a large number of fits where no Gaussian is superimposed (i.e. the null 
hypothesis) and determining the mean and RMS of this distribution and 
using these to estimate the number of ex's the observed NCCIZ is from the 
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2002/06/07 14,49 




Fig. 4. The solid curve shows the distribution of the negative log likelihood ratio 
MCCR. at the maximum likelihood point for 500 distributions, using the iterative 
smoothing function mechanism. The dashed curve shows the corresponding distri- 
bution in the case of a constant smoothing function. 

null case. Table 1 also gives the results of a binned fit on the same "data" . It 
can be seen that the unbinned fit gives a 3(7 discrimination when the number 
of Gaussian events is 85, where as the binned fit gives a jndj of 42/39 for 
the same case. 

Figure 6 shows the variation of -log P(c^|s) and -log P^^^(cn) for an ensemble 
of 500 experiments each with the number of events n — 1000 in the exponential 
and no events in the Gaussian (null hypothesis). One notes that -log P(c^|s) 
and -log P^^^(c^) are correlated with each other and the difference between 
the two (-log HCCTZ) is a much narrower distribution than either and provides 
the goodness of fit discrimination. 
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Fig. 5. Figure shows the histogram (with errors) of 1000 events in the fiducial 
interval 1.0 < c < 5.0 generated as an exponential with decay constant s=1.0 with 
a superimposed Gaussian of 500 events centered at c=2.0 and width=0.2. The PDE 
estimator is the (solid) histogram with no errors. 

5.4 Improving the PDE 

The PDE technique we have used so far suffers from two drawbacks; firstly, 
the smoothing parameter has to be iteratively adjusted significantly over the 
full range of the variable c, since the distribution P{c\s) changes significantly 
over that range; and secondly, there are boundary effects at c=0 as shown in 
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Table 1 

Goodness of fit results from unbinned likelihood and binned likelihood fits for 
various data samples. The negative values for the number of standard deviations in 
some of the examples is due to statistical fluctuation. 



Number of 


Unbinned fit 


Unbinned fit 


Binned fit 


Gaussian events 


Nccn 


Nu 


39 d.o.f. 


500 


189. 


103 


304 


250 


58.6 


31 


125 


ino 

±\J\J 


J. ± .U 


4 Q 


48 


85 


8.2 


3.0 


42 


75 


6.3 


1.9 


38 


50 


2.55 


-0.14 


30 





0.44 


-1.33 


24 



figure 2. Both these flaws are remedied if we define the PDE in hypercube 
space. After we find the maximum likelihood point s*, for which the PDE 
is not needed, we transform the variable c — > c', such that the distribution 
P(c'|s*) is flat and < c' < 1. The hypercube transformation can be made 
even if c is multi-dimensional by initially going to a set of variables that are 
uncorrelated and then making the hypercube transformation. The transfor- 
mation can be such that any interval in c space maps on to the interval (0, 1) 
in hypercube space. 
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Fig. 6. (a) shows the distribution of the negative log-Ukehhood -log(.{P{c^\s)) for 
an ensemble of experiments where data and experiment are expected to fit. (b) 
Shows the negative log PDE likelihood -loge{P{c^)) for the same data (c) Shows 
the correlation between the two and (d) Shows the negative log-likelihood ratio 
MCCIZ that is obtained by subtracting (b) from (a) on an event by event basis. 

5.5 Periodic Boundary Conditions 

We solve the boundary problem by imposing periodicity in the hypercube. 
In the one dimensional case, we imagine three "hypercubes" , each identical 
to the other on the real axis in the intervals (—1,0), (0,1) and (1,2). The 
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hypercube of interest is the one in the interval (0, 1). When the probabihty 
from an event kernel leaks outside the boundary (0, 1), we continue the kernel 
to the next hypercube. Since the hypercubes are identical, this implies the 
kernel re-appearing in the middle hypercube but from the opposite boundary. 
Put mathematically, the kernel is defined such that 

g{d -(J^ = G{d -4-1)- c'>l (59) 
^(c'-4) =^?(c'-c^ + l); c'<0 (60) 



Although a Gaussian Kernel will work on the hypercube, the natural kernel 
to use considering the shape of the distribution in hypercube space (it is flat 
for a good fit), would be the "boxcar function" G^c'). 

m = p \c'\ < \ (61) 

G{d) = 0; |c'| > \ (62) 

This kernel would be subject to the periodic boundary conditions given above, 
which further ensure that every configuration in hypercube space is treated 
exactly as every other configuration irrespective of its co-ordinate. The pa- 
rameter /i is a smoothing parameter which needs to be chosen with some care. 
However, since the theory distribution is fiat in hypercube space, the smooth- 
ing parameter may not need to be iteratively determined over hypercube space 
to the extent that data distribution is similar to the theory distribution. Even 
if iteration is used, the variation in h in hypercube space is likely to be much 
smaller. 

Figure 7 shows the distribution of the NCCTZ for the null hypothesis for 
an ensemble of 500 experiments each with 1000 events as a function of the 

smoothing factor h. It can be seen that the distribution narrows considerably 
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Fig. 7. The distribution of the negative log hkehhood ratio MCCTZ for the null 
hypothesis for an ensemble of 500 experiments each with 1000 events, as a function 
of the smoothing factor /i=0.1, 0.2 and 0.3 

as the smoothing factor increases. We choose an operating value of 0.2 for h 
and study the dependence of the MCCTZ as a function of the number of events 
ranging from 100 to 1000 events, as shown in figure 8. The dependence on the 
number of events is seen to be weak, indicating good behavior. The PDE thus 
arrived computed with h=0.2 can be transformed from the hypercube space 
to c space and will reproduce data smoothly and with no edge effects. We note 
that it is also easier to arrive at an analytic theory of MCCTZ with the choice 
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of this simple kernel. 
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Fig. 8. The distribution of the negative log likelihood ratio MCCTZ for the null hy- 
pothesis for an ensemble of 500 experiments each with the smoothing factor h=0.2, 
as a function of the number of events 



6 The distribution of the goodness of fit variable 

Of all the goodness of fit variables we have studied above, for both binned 
and unbinned likelihood fits, the variable is the most studied and has an 



32 



analytic theory associated with its distribution. This is used to set a p- value 
for the goodness of fit, defined as the probabihty to exceed the observed value 
based on its analytic distribution. In the absence of an analytic theory, 
it is possible to use Monte Carlo methods to obtain the distribution of the 

goodness of fit variable for the hypothesis being tested and to numerically 
obtain the p- value. 

7 Calculation of fitted errors 

After the fitting is done and the goodness of fit is evaluated, one needs to work 
out the errors on the fitted quantities. One needs to calculate the probabil- 
ity density P(s|c^), which carries information not only about the maximum 
likelihood point s*, from a single experiment, but how such a measurement is 
likely to fluctuate if we repeat the experiment. This problem is known as the 
"problem of inverse probabilities" in statistical literature and is solved by the 
use of Bayes' theorem. Since Bayes' theorem is central to the arguments that 
follow, we give a simple derivation of it here. 

7. 1 Derivation of Bayes ' theorem equations 

Consider a joint probability distribution P{s, c) in variables s, c. For the sake 
of simplicity, we will take both s and c to be one-dimensional. The arguments 
being made are general enough to easily change them into multi-dimensional 
variables. Figure 9 shows geometrically the two dimensional space of s and c. 
We plot s as the ordinate and c as the abscissa. At this stage s and c are two 
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general variables. Then, 

J J P{s, c)dsdc = 1 (63) 

We define the single variable probabihties P(c) and P{s) as 

P(c) = J P{s,c)ds (64) 
P(s) = J P(s,c)dc (65) 

P(c) is the probabihty density of c irrespective of the value of s and P{s) 
is the probability density of s irrespective of the value of c. It follows from 
equation 63 that 

J P(s)ds = 1 (66) 

and 

J P(c)dc = 1 (67) 

We define a conditional probability P{c\s) as the probability of observing c 
given s. It is thus, the joint probability P(s, c) along the shce AB (s=constant) 
in figure 9, appropriately normalized to unity, i.e, 

JP{s,c)dc 

where the denominator in the above equation ensures that / P{c\s)dc = 1. 
Therefore, (using equation 65) 

Pic\s) = ^ (69) 
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Fig. 9. Joint probability distribution in the variables s, c. Conditional probabilities 
are computed along the slices AB( s=constant) and CD(c= constant). 

By symmetrical arguments (integrations along the slice CD), we show that 
the conditional probability P{s\c) is given by 

P(.ic) ^ ^ (70) 

leading to the joint probability equation 

P{s, c) = P{c\s)P{s) = P{s\c)P{c) (71) 

which is sometimes written in a more familiar form known as Bayes' theo- 
rem [7] as 

P(.|e) ^ (72) 
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By substituting the expression for P{s, c) in equation 69 in equation 64 we 
get the equation 

P(c) = J P{c\s)P{s)ds (73) 

and by substituting the expression for P(s, c) in equation 70 in equation 65 
we get the equation 

P(s) = j P{s\c)P{c)dc (74) 

These complete the Bayes' theorem equations. Note also that the joint prob- 
ability equation 71 can be written in a form a likelihood ratio Cr 

_ P{s\c) _ P{c\s) 

The quantity Cr equation 75 is invariant under change of variables c ^ c' 
and s — > s', since the Jacobian of the transformation |^| divides out in the 
numerator and the denominator for the right hand side of the equation 75 for 
the ratio of probability densities in ^0^- Similarly the ratio is invariant under 
the transformation variable s in the LHS of the equation. These invariances 
are essential in the use of the ratio Cr as a goodness-of-fit variable. 

We can then extend the derivation given above to derive Bayes' theorem equa- 
tions for the dataset c^. 

P(s, cl) = P(c;|s)P(s) = Pis\cl)Pic-) (76) 

P{Cn) = / P{Cn\s)P{s)ds (77) 
P{S) = J P{s\Cn)P{c„)dCn (78) 
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Let us note that Bayes' theorem as derived above is a theorem in mathematics 
that apphes to any integrable function of two variables P'{s, c) for which the 
normahzed function P(s, c) = r r'^i^'i , , can be constructed. The "proba- 
bihties" P{s) and P(c) are projections of such a function and the "conditional 
probabilities" P{c\s) and P{s\c) arc normalized slices of such a function. How 
we identify the slices and projections to correspond to probability theory is up 
to us. Once the joint probability function P{s, c) can be specified, the inverse 
probability problem can be solved, since the inverse probability is given by the 
slice P{s\c) and the theoretical hkelihood is given by the slice P{c\s). In order 
to specify the joint probabihty, the Bayesians specify the theoretical hkelihood 
P{c^\s) and the projection on the parameter axis P{s) which they term the 
Bayesian prior. We now describe the Bayesian paradigm and the difficulties 
associated with it. 



7.2 The Bayesian Paradigm 



The theoretical likehhood P{c^\s) is specified by the model in question. In 
order to specify the joint probability, Bayesians supply the function P{s), 
which they term the Bayesian prior. The joint probability is then given by 

p{s, 4) = p(c;|s)p(s) = p(s|c;)p(c;) (79) 

and the inverse probability is obtained by 

Cri\s)P{s) ^ P{Cn\s) 
P{Cn) ~ J P{Cn\s)P{s)ds 



The Bayesian prior P(s) is an unknown function that according to the Bayesians 
encapsulates the prior knowledge the experimenter has on the true value of the 
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parameter. There exists many different methods of estimating the unknown 
prior. We briefly describe them here. 

7.2.1 Objective Bayesianism 

In this sub-branch of Bayesianism, the prior is assumed to be known within 
some limits that the user supplies and the distribution is assumed to be 
flat within those limits. This approach is championed by statisticians such 
as Jaynes [9]. If one specifies a flat prior in a variable s, then it is clearly 
not a flat prior in a transformed variable s' — s'{s). The maximum likelihood 
analysis can be carried out in any function s'{s) and the maximum likelihood 
point s* would be the same under such transformations, i.e. 

s*'^s'{s*) (81) 

However, if a flat Bayesian prior is assumed in one variable, it would not be 
flat in any function of that variable. The results would depend on the prior 
assumed. This is a serious objection to this method. 

7.2.2 Subjective Bayesianism 

This approach is championed by statisticians such as de Finetti [10] where the 
experimenter specifies the prior based on "subjective criteria" based on his 
past experience and knowledge of the parameter s. If more than one experi- 
menter is involved, then more than one prior can be used and more than one 

posterior density results. 
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7.2.3 Hierarchical Bayesianism 

This sub- branch of Bayesianism attempts to parametrize the prior [11] in terms 
of more unknown parameters each of which have their own priors, forming a 
hierarchy. 

7.2.4 Empirical Bayesianism 

Empirical Bayesianism [12] attempts to stem the infinite hierarchy of priors 
imphed by hierarchical Bayesianism by attempting to determine some of the 
parameters associated with the priors from the data. 

Note all Bayesians [13] treat the projection P(c^) as an uninteresting constant 
of normalization, theoretically obtained by the equation 



whose right hand side consists of an integral over the Bayesian prior and the 
theoretical likelihood. However, having solved the goodness of fit problem, we 
have demonstrated that the data likehhood P'^"*"(c^) carries with it vital in- 
formation relevant for goodness of fit. It may be thought of as the pdf of the 
n — object c^ and must be empirically determined from experimental data. 
If we use this function from the data as a projection of the joint probability 
and the theoretical likelihood P{c^\s) as a slice, then we can invert the prob- 
ability to obtain P{s\c^) without the use of a Bayesian prior. What results is 
a new paradigm in statistics where we have to re-define some concepts to be 
consistent with this interpretation. 




(82) 
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1.3 The New Paradigm 

We note that if we identify the projection P{cn) of the joint probabiUty P(s, c^) 
as the data-hkehhood, which we denote P'^"*"(c^^), then the projection on the 
s axis depends on n and is thus incompatible with being a Bayesian prior that 
is independent of n. 

To show this, let us note that the inverse probability P{s\cn) 5{s — st) 
as n ^ oo where st is the true value of s. This is a result of the central 
limit theorem of statistics and contains the assumption that the experiment 
is unbiased. Then, using equation 78 and P'^"*"(ci^) for one of the projections, 
one obtains 



But, P'^"'^"'{cn)dcn represents the probability of obtaining the dataset in the 
neighborhood of the dataset c^. If one were to repeat the experiment N times, 
thus obtaining an ensemble of datasets, then P'^^"-{cn)dcn = ^ ior N ^ oo. 
Then, 



where k denotes the ensemble member and the symbols <> represent aver- 
age over the ensemble of the function. However, since Pk{s\c^) — > 5(s — st), 
as n ^ oo, we would expect P{s) S{s — st) in this limit, i.e. P{s) can- 
not find interpretation as an n-independent Bayesian prior. We note however, 
that if one were to plot the probability distribution of the maximum likelihood 
value s* of each member of the ensemble, then such a distribution would have 




(83) 




k=N 



(84) 
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the desired n dependence, becoming narrower for larger n. We thus build our 
theory by identifying the projection on the parameter axis as the probabil- 
ity distribution of s*, which we denote by P„(s*), explicitly indicating the n 
dependence. Then the joint probability distribution P(s*, c^) is given by 

P{s*,Cr:)^P{C\s*)Vn{s*) (85) 

Each member k of the ensemble has a maximum likelihood value s\. The 
probability distribution of this quantity over an infinite ensemble is defined to 
be Vn{,s*). This definition is similar in spirit to the "fiducial probability" of 
R. A. Fisher [14]. We are now able to specify the joint probability P{s*,c^) 
as per equation 85. Then by Bayes' theorem, we can also write 

P(S*, Cn) = Pis*\Cn)P'"'"'(Cn) = P{Cn\s*)Vn(s*) (86) 

where this equation is the definition of the inverse probability P{s*\c^). This 
imphes that from the A;*'* element of the ensemble, consisting of a single dataset 
c^, not only is the maximum likelihood value si available, but also information 
on the distribution of s* from other similar datasets on the ensemble. It is the 
availability of this information that permits the estimation of errors based on 
one dataset. This then leads to the solution of the inverse probability on the 
ensemble by the usual Bayes' theorem equation. 

\^-) pduta^cl) J P{Cn\s*')Vn{s*')ds*' ^ ' 

Equation 87 shows us how to obtain the inverse probability P{s*\c^) once we 
have the ensemble and hence is not much use to us, since it requires an infinite 
number of similar experiments on the ensemble. Our problem is to obtain the 
inverse probability given a single member of the ensemble. Before we proceed 
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to solve this problem, let us note that on the ensemble, Bayes' theorem is 
expressed in the following elegant set of equations. 



V{Cn) = J P{Cn\s*)Vn{s*)ds* =< P{Cn\s*) > 
rk^.*, Pk{s*\cZ) Pk{Cn)\s*) 




(90) 



(89) 



(88) 



<P{s*\Cn)> <P{Cn)\s*)> 



In the above set of equations, we have used the symbol V{c^) to denote the 
data pdf determined on an infinite ensemble to distinguish it from P^"*"(c5^) 
which is the data pdf determined from a single member of the ensemble. The 
former benefits from the statistics present in the infinite ensemble. Equation 90 
gives the the likelihood ratio on the ensemble of each member of the ensemble 
that may be used for goodness of fit once the ensemble is known. Note that 
there is no Bayesian prior used anywhere in the above set of equations. 

7.3.1 The true value of the parameter s 

The true value st of the parameter s is defined to be that value of s* at which 
the maximum of the pdf Vn{s*) occurs. Let us remember that Vn{s*) has an 
infinite number of similar datasets contributing to it and hence this is just 
a statement of the experiments being unbiased. The true value is a number. 
It does not possess a distribution. 

7.3.2 The unknowability ofVn{s*) 

Since the true value st can never be determined to infinite precision, and the 
true value is the abscissa for which the pdf Vn{s*) is the maximum, it follows 
that the function Vn{s*) is unknowable. We cannot associate an abscissa to the 
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function Vn{s*) and hence the function cannot be "anchored" to the s* axis. 
If we could anchor it, we could read off the value of st to infinite precision by 
determining the maximum likelihood value of the function. We thus term this 
function an "unknown concomitant" , to distinguish it from a Bayesian prior. 
It is an abstraction which we approach with ever increasing precision as we 
increase A^, the number of members on the ensemble. 

7.3.3 The standard error on the fitted parameter 

The function is the probability distribution of the maximum likelihood 

values si on the ensemble. The standard error (T„ of the fitted parameter is 
defined as 

al = I {s* - STfVn{s*)ds* =< {s* - srf > (91) 

We also note that Vn{s*) =< P{s*\c^i) > is also the ensemble average of the 
inverse probability functions Pk{s*\c^). Hence the inverse probability function 
Pkis*\c^) from a single dataset k is an unbiased estimator of the (unknowable) 
function P„(s*) and its variance can be used to estimate the standard error 

7.3.4 The evaluation of the inverse probability Pk{s*\c^): The error bootstrap 

We now need to compute the function Pk{s*\c^). We employ Bayes' theorem to 
do this. The error on the fitted parameter s* will be related to the width of the 
inverse probability Pk{s*\c^) that we are trying to compute. It is also related 
to our ignorance of the value of st and our inability to anchor the distribution 
Vn{s*). Our level of ignorance of where to anchor the distribution Vn{s*) is 
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directly related to the error we are trying to compute and is directly related 
to the width of Pk{s*\c^). At this stage, we have worked out the hkelihood 
ratio Cti^{s) as a function of s and have evaluated the maximum likelihood 
value s%. The argument s of the likelihood ratio is a dummy argument of a 
function and we are at liberty to change it to the argument s* as in Cii^{s*) 
for further discussion. We can choose an arbitrary value of s* and evaluate 
the goodness of fit at that value using the hkelihood ratio. When we do this, 
we are in fact hypothesizing that sr, the true value, is at this value of s* . 
The function Cr{s*) then gives us a way of evaluating the goodness of fit of 
the hypothesis as we change s* . Let us now take an arbitrary value of s* and 
hypothesize that that is the true value. Then, consistent with our hypothesis, 
we must insist that the distribution Vn{s*) is moved so that the maximum 
value of the distribution (i.e. st) is at the current value of s*. 

At the true value st, the Bayes' theorem equations for the joint probability 
state 

P{ST, C) = P{Cn\sT)Vn{sT) = P {sT\Cn) P'^"' {C) (92) 

We now hypothesize that the true value is at s* — s-i. Then the above equation 
will read 

P(S1, Cn) - P{Cn\s,)Vn{sT) = P{s,\Cn)P'"''\Cn) (93) 

When we change the hypothesis to a different value s* = S2, then the equation 
will read 

P{S2, Cn) = P{Cn\s2)Vn{sT) = P{s2\Cn)P'"''%Cn) (94) 
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We have moved the distribution Vn{s*) to accommodate our changing hy- 
pothesis from the true value being at si to S2- These hypotheses are mutually 
exclusive in that the true value cannot be at both si and S2- This mandates 
that we move the function Vn{s*) as we change the hypothesis. This set of 
hypotheses thus communicate to our system of equations, our ignorance of the 
position of the true value. The set of hypotheses form an OR of the position 
of the true value, whereas by contrast, the Bayesian prior expresses an AND 
of the position of the true value. Then for the hypothesis that the true value 
is at an arbitrary s*, the above equations become 

P{s*,Cn) = P{Cn\s*)rn{sT) = P {s* \Cn) P""'" {Cn) (95) 

Re- arranging, 

^(^l^n) = ^^fni-ST) (96) 

Imposing the normahzation condition / P{s*\c^)ds* — 1 yields 

where we have explicitly indicated the dependence on the ensemble index k. 
To reiterate, when one varies s* in equation 95, one makes the hypothesis 
that s — St- As one changes s*, a new hypothesis is being tested that is 
mutually exclusive from the previous one, since the true value can only be at 
one location. So as one changes s*, one is compelled to move the distribution 
Vn{s*) SO that St is at the value of s* being tested. // one did not move Vn{s*), 
then this is tantamount to anchoring the function to the s* axis and this is 
not allowable, since the true value is unknown. This implies that Vn{sT) does 
not change as one changes s* and is a constant wrt s*. Figure 10 illustrates 
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these points graphically. Thus Vnisr) in our equations is a number, not a 
function. We have thus "bootstrapped" the error. On the one hand, Pk{s*\c^) 
gives us an estimate of the spread in the measurements of s* from an ensemble 
of datasets c^, based on one such data set. On the other hand, the error in s* 
is expressed in the uncertainty on where to put St- We have connected these 
two uncertainties using Bayes' theorem and hypothesis testing. Also, 



We have thus determined Vnisr), the value of the "unknown concomitant" at 
the true value sx using our data set c„. This is our measurement of 'P„(st) and 
different datasets will give different values of Vnisr), in other words 
will have a sampling distribution with an expectation value and standard 
deviation. 

Note that it is only possible to write down an expression for Vnisr) dimension- 
ally when a likelihood ratio C-fi is available. The equation 97 is the same expres- 
sion that "frequentists" use for calculating their errors after fitting, namely 
the likelihood curve normalized to unity gives the parameter errors. If the 
likelihood curve is Gaussian shaped, then this justifies a change of negative 
log-likelihood of | from the maximum likelihood point to get the la errors. 
Similarly, when performing fitting, it is now rigorously permitted to use 
Ax^ = 1 to estimate errors in fitted parameters under the Gaussian assump- 
tion. No Bayesian prior is needed. 

The "Objective Bayesians" may be tempted to remark that equation 97 is the 
same equation they would derive using a fiat prior and so the two theories are 
equivalent. This is not the case, since their projection of the joint probability 



Vnisr) 




(98) 



/ jCR{s*)ds* J P{cn\s*)ds* 
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on the parameter axis is the Bayesian prior which does not depend on n (for 
all Bayesians), whereas in our case it yields a function Vn{s*) which depends 
on n. So the two theories are radically different. Also, in the new paradigm, 
one does not have to answer the question "Flat in what variable?", as the 
"Objective Bayesians" have to do regarding the prior they use. Our theory is 
invariant no matter what the density of the hypotheses we make in s* space 
is. The Bayesians will obtain different results when they use different densities 
for the prior distribution. 



7.3.5 Iterative behavior of the theory when more than one member of the 
ensemble is available 

We have now solved the problem for the case when one member of the ensemble 
is available. This is what happens in most cases, when only one dataset exists. 
If however we want to study the theory when more the one member of the 
ensemble is present, we can proceed to use equation 88 to work out a better 
approximation for Vn{s*). 

k=N 

Vn{s*) =< Pis*\Cn) >~ T7 E Pkis*\Cn) (99) 

k=l 

We now have an approximation to the function Vn{s*) which is based on 
N datasets instead of one. This approximation can be used to iterate our 
functions Pk{s*\c^) using equation 87. 

^'''"^"/n(c;k*')K(.*>.*' 

where we have used the superscript (2) to indicate that this is the second 
iteration of the function. These sets of functions can be used to obtain a new 
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version of the function Vn{s*) using the relation 



1 k=N 

-Pi'H^*) - Tf E Pt"\^*K) (101) 



Similarly, one needs to iterate on the theoretical hkelihoods Pk{c^)\s*) when 
information from more than one member of the ensemble is available. This is 
done by using equation 89 to obtain an approximation for V{cn). 

k=N 

N 



V{c-) =< P{c-\s*) >~ ^ E Pk{cn\s*) (102) 



k=l 



This equation states that it is possible to get a better estimate of the density 
of data V{cn) from all the N members of the ensemble combined than from a 
single member alone. This is followed by 

pf (c;|.*) = ^^(^15)^(5) ^ (103) 
! Pk{s*K)v{d^)dc^ 

This equation states that in the presence of all N members of the ensemble, 
it is possible to obtain a better value of the likelihood Pk{cn\s*) than the 
theoretical likelihood which assumes that the true value st is at s* . The above 
functions are used to derive an iterated version of the data density V{c^) that 
uses information available from all the members of the ensemble to compute 
the data density. 

1 k=N 

^ - Pt'\cl\s*) (104) 

k=l 

7.3.6 Combining Results of Experiments 

Each experiment should publish a likelihood curve for its fit as well as a number 
for the data likelihood P°'"*"(cn). Combining the results of two experiments 
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with m and n experiments each, involves multiplying the likelihood ratios. 

r^TZ m+n(s) = m(s) X = ^^^^^ X ^^^^ (105) 

Inverted probabilities and goodness of fit can be deduced from the combined 
likelihood ratio. It is worth noting that the numerator of the likelihood ratio 
is a function of the fitted parameters and the denominator is a number. 



7.5". 7 Another Illustrative Example 

We now apply the theory developed here to a practical example to illustrate 
the ideas further. The problem is to determine the weight of an object using 
an apparatus whose standard error is known to be 5 gm. The weight is a fixed 
constant of nature for the duration of the experiment. We obtain a dataset of 
100 measurements, i.e. n = 100. Then P{c\s) is a Gaussian of unknown mean 
s and width cr = 5 gm. We compute P{c^\s) for the 100 events by multiplying 
the individual P{ci\s) together and maximize the likelihood to determine s* for 
the dataset using unbinned likelihoods. We then transform the measurements 
Cj to the hypercube space using equation 3. We use the improved PDE in 
hypercube space with h — 0.2 and determine the goodness of fit and the 
negative log-likelihood ratio MCCTZ. We repeat this for an ensemble of 1000 
experiments. 

Figure 11(a) shows the distribution of s* for this ensemble. The mean value of 
s* over this ensemble is 49.98 gm and the RMS is 0.495 gm which is consistent 
with the expected cr/^^lOO) value of 0.5 gm. Figure 11(b) shows the distri- 
bution of MCCR, for the 1000 members of the ensemble. Figure 11(c) shows 
the likelihood ratio functions Cr{s*) for the first 10 fits in the ensemble. The 
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value of s^., the maximum likelihood value of the k member of the ensemble 
fluctuates as expected, as well as the value of Cr{sI), the negative logarithm 
of which gives the MCCR,. The fluctuation in for the fits in the ensemble 
essentially expresses our lack of knowledge of the position of the true value 
St- The width of the likelihood distribution also contains information on the 
same lack of knowledge. 

We now use equation 97 to obtain inverse probabilities Pk{s*\cn) for each 
member of the ensemble. These functions are shown in Figure 11(d). The 
maximum likelihood value moves around with the expected spread of 0.5 gm. 
The average standard deviation of these curves is 0.5 gm with an rms of 0.65 E- 
3 gm. The average of these functions on an infinite ensemble yields the true 

Pdf Vn{s*). 

7.3.8 Iterative behavior of the theory in the example 

In practice, if one has a dataset with n = 100 and N — 1000 similar instances 
of them, the easiest way to analyze the data is to combine them all into a 
dataset with n' — Nn — 100, 000. However, we are interested in studying 
the function which is estimated by the ensemble average of the func- 

tions Pk{s*\c^). This function tells us the behavior of the distribution of the 
maximum likelihood values s* over similar datasets each with n=100. 

We now iterate to re-determine Pk{s*\c^) and Vn{s*) as per equations 100 
and 101. Figure 12(a) shows the ensemble average estimate of Vn{s*) for 
n=100 and N=1000 before and after iteration. The mean value of the un- 
iterated and iterated functions are the same at 49.977 gm (The Gaussians 
were generated with a true value of 50 gm). The r.m.s values of the function 
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before and after iteration are 0.701 gm and 0.522 gm respectively. The iterated 
function thus has the correct width and mean value. Figure 12(b) shows the 
individual Pk{s*\c^) functions for two members of the ensemble before and af- 
ter iteration. The iterations pull these functions towards the true value, since 
we are inputing additional information on the true value. Figure 13(a) shows 
the values of s* histogrammed for our illustrative example for an ensemble of 
N=1000 and n=100. The superimposed curve is the iterated function Vn{s*) 
calculated for this ensemble normalized to a 1000 element ensemble. It can 
be seen that the function describes the distribution of s* well. Figure 13(b) 
shows the iterated function Vn{s*) for n — 100 and n — 200 respectively. As 
expected, the n = 200 function is narrower and its value at the maximum is 
larger, illustrating that Vnisr) increases with n. 

We can proceed to iteratively work out the likehhoods Pk{c^\s*) and P(c^) as 
per equations 103 and 104. However, it is difficult to plot these functions since 
their argument is multidimensional. Instead we show how the iteration works 
for a special case of the above example where the dataset consists of a single 
measurement, i.e. n = 1. We consider an ensemble of = 1000 measurements 
each with a cr = 1.0 gm. Each single measurement c is fitted to a Gaussian 
likelihood. The maximum likelihood point s* is trivially equal to c and the 
goodness of fit likelihood ratio is always unity. The inverted probability is a 
Gaussian given by 

P{s*\c) = -= (106) 

v27r(T 



We proceed to work out the function Vn{s*) by averaging the above func- 
tions over the ensemble. The resulting Gaussian will be a convolution of 
two Gaussians with width a = 1.0 and will possess a width a = \/2. Sim- 
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ilarly we proceed to work out V{c) by averaging the theoretical hkehhoods 
Pk{c\s*) over the ensemble. This curve will also be too wide for similar rea- 
sons. Figure 14 shows the resulting curves for the first iteration plotted on 
top of the histograms for the data for s* and c respectively. Figure 15 shows 
the curves after the second iteration and both the curves fit well. We now 
plot the resulting joint probability P{s*,c) obtained two different ways. Fig- 
ure 16 shows the joint probability worked out by the B ayes' theorem equation 
P{s*, c) = P{c\s*)Vi{s*) and Figure 17 shows the joint probability worked out 
by the equation P(s*,c) = P(s*|c)P(c) after the iterations have been made. 
It can be seen that both these procedures give the same joint probability dis- 
tribution that possesses a correlation between c and s* that is less extreme 
than the initial correlation of c = s*. The projections of the joint probability 
on the c and the s* axes fit the data well. We have iteratively solved Bayes' 
theorem on the ensemble and inverted the probability correctly without the 
use of a Bayesian prior. 

7.4 The two different methods to obtain 

In our theory, Vn{s*) is the function obtained by histogramming the maximum 
likelihood values for an infinite ensemble of datasets and normalizing the 
resulting histogram to unity, i.e. it is the probability density function of the 
maximum likelihood values on the ensemble, for datasets each containing n 
elements. However, equation 99 shows another way of obtaining the same 
function. What is the connection between the two methods? 

Without loss of generality, we can express the inverse probability function as 
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a function of s* — such that 



Pk{s*\c„) = Qk{s* - si) 



(107) 



Then equation 99 can be re-expressed 



k=l 



^ k=N 



(108) 



But this is just the probabihty density estimator (PDE) for the distribution of 
s*, with the functions Qk serving as the kernels!. They satisfy the normahzation 
condition / Qk{t)dt = 1 as required. This should be compared with equation 51 
for the definition of PDE's. Thus Vn{s*) represents a PDE of the distribution 
of s* and will yield the same distribution as s*. 

In the limit — > oo, we can represent the distribution of the maximum 
likelihood values s* on the ensemble as the continuous pdf Vn{s*). In this 
limit, one can write 



of the kernel as a function of s*' (i.e. ensemble element). The latter half of 



eigenfunction is Vn{s*). 

Let us also note that the iterative method used to solve Bayes' theorem in the 
example given above where c — s*, can be used as a PDE method to adjust 
the kernels by changing their shape iteratively without resort to an adjustable 
parameter h. We could have fed in data generated as an exponential(for ex- 
ample) with the assumption that each measurement has a Gaussian error. 




(109) 



where we have used the notation Q{s*' ,s* — s*') to emphasize the variation 



the above equation is an integral equation with kernel Q{s*' ,s* — s*') whose 
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Then each of the Gaussian kernels would have been altered by the resulting 
exponential function Vn{s*) iterativcly yielding a PDE for the exponential. 

7.5 Co-ordinate transformations s*' — s*' (s*) 

The inverse probability density functions P{s*\c^) are invariant under the 
co-ordinate transformations c' = c'(c). How do they behave under transforma- 
tions s*' — s*'{s*)7 The function Pk{s*\c^) represents our estimate using one 
member of the ensemble of the pdf of s*. So if Pk{s*\c^) represents a pdf, we 
would expect it to behave like a pdf, namely 

ds* 

P^{s*'\Cn)^P{s*\Cn)\-^\ (110) 

This is how pdf's transform (via the Jacobian). This can be shown patently 
not to be so, since Pk{c^\s*') — Pk{c^\s*) and 

Pk{s*'\Cn) = , ^fffll = Xk{Cn)P,{s*\cl) (ill) 
jPk{Cn\s* )ds* 

where the s* independent constant Ajt(c^) is given by 

A^K) = (112) 

j Pk{Cn\s* )ds* 

i.e. the inverse probability densities do not transform in a way that is expected 
of pdf's. This was perhaps a naive expectation. As we have just demonstrated, 
the inverse probability densities serve the purpose of kernels on the ensemble, 
the ensemble average of which gives the pdf Vn{s*). There is no need for the 
kernel from a member of the ensemble to transform to the kernel from the 
same member under these transformations. The properties of the ensemble 
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average deduced from the individual kernels will fluctuate from kernel to ker- 
nel. Similarly, when one analyzes in transformed variables, the same kernel will 
give different results which may be thought of as being part of the fluctuation. 

The distributions of the maximum likelihoods Vn{s*) however will transform 
as pdf's, since Vn{s*) represents the probability density of the maximum like- 
lihood value and s'* = s'{s*). i.e. 

Kis*') = (113) 

The transformed kernels after iteration will yield the transformed V'n{s*'). 
7.5.1 Comparison with the Bayesian approach 

Table 2 outlines the major differences between the Bayesian approach and the 
new paradigm. 

8 Conclusions 

To conclude, we have proposed a general theory for obtaining the goodness of 
flt in likelihood flts for both binned and unbinned likelihood flts. In order to 
obtain a goodness of flt measure, one needs two likelihoods:- one derived from 
theory and the other derived from the data alone. In order to compute the 
errors on fltted quantities, inverse probability densities need to be worked out 
and Bayes' theorem needs to be employed. Using insights gained in solving 
the goodness of flt problem, we demonstrate that it is possible to estimate the 
inverse probability densities without the use of Bayesian prior. This results in 
a new paradigm in statistics. 
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Table 2 

The key points of difference between the Bayesian method and the new method. 



Item 



Bayesian Method 



New Method 



Goodness 



of fit 



Absent 



Now available 
in both binned 
and unbinned fits 



Data 



Used in evaluating 
theory pdf 
at data points 



Used in evaluating 

theory pdf 

at data points 

as well as evaluating 

data pdf at data points 



Prior 



Is a distribution 



that is guessed based 



on "degrees of belief" 



Independent of data, 



monolithic 



No prior needed. 



One calculates a 



constant from data 



VnisT) 



/ P{Cn\s*)ds'- 



oo as n — >■ GO 



Inverse probability 



density 



Depends on Prior. 



Independent of prior. 



same as frequentists use 



p/„u-^ _ P{Cn\s)P(s) 



P{s*\c; 



J P(c^|s') ds' 
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10 Appendix 

In order to demonstrate the capabihties of the unbinned goodness of fit method, 
we illustrate its power with the following example. 

10.1 An extreme problem 

We now attempt to solve a problem with three observed data points, made 

extreme due to the sparsity of data. The problem is stated as follows. 

"Three data points are observed [15] in three dimensional co-ordinate space 
x,y,z with (x,y,z) = (0.1,0.2,0.3), (0.2,0.4,0.1), and (0.05,0.6,0.21). What is 
the goodness of fit to the hypothesis that the observed number of events is 
distributed according to p(a;, y, z) = e~(^+2'+^) ?" 

10.2 Goodness of fit for the above problem 

We note that the likelihood function for the problem is 

1=3 ^ 

C^^-exp-{{xi + yi + Zi)/s) (114) 

where we assume a maximum likelihood fit has been done and the lifetime 

parameter s has been determined to be s* = 1 at the maximum. Since the 



57 



three co-ordinates x,y, and z are uncorrelated (as per the above hkehhood 
function) , we can reformulate the problem as a single dimensional problem as 
follows. 




(115) 



where the n=9 vector Cn = 0.1 0.2 0.3 0.2 0.4 0.1 0.05 0.6 0.21 

We transform the co-ordinates to the hypercube space {s* — 1), with the limits 
of c assumed to be 0.0 and 10.0 ^ . 

Figure 18 shows the transformed co-ordinates in hypercube space. We then 
proceed to work out the negative log- likelihood ratio MCCR, for this configura- 
tion with the "smoothing parameter /i" set to three different values h — 0.2, 0.3 
and 0.4. We study the behavior of the MCCTZ for the null hypothesis (i.e. n=9 
events distributed uniformly in hypercube space) for a 1000 such experiments. 
We repeat this for a dataset of n = 100 as well to study the effect of the small 
data sample on our goodness of fit measure. Figure 19 shows the distribution 
of the HCCTZ for the three different values of h for a data set size n — 9. 
Figure 20 shows the distribution of the MCCTZ for the three different values 
of h for a data set size n = 100. Table 3 summarizes the observed AfCCTZ for 
our dataset as a function of h. The mean and sigma of the null hypothesis 
histograms are also shown as well as the probabihty that the observed MCCTZ 
is exceeded for both the n = 9 null hypothesis and an n = 100 null hypothesis. 
The latter is run to test the sensitivity of the results to the small data sample. 



Since the program expects a finite upper limit, the high value of c=10 is deemed 
to be sufficiently large to be infinite for this problem. 



58 



Table 3 



Summary of results 



Smoothing 


Dataset 


n=9 


n=9 


n=9 Prob. 


n=100 


n=100 


n=100 Prob. 


parameter h 


Nccn 




a 


to exceed 


M 


a 


to exceed 


0.2 


5.36 


0.82 


1.26 


0.5% 


0.77 


1.255 


0.3% 


0.3 


5.84 


0.34 


0.96 


< 0.1% 


0.30 


0.91 


< 0.1% 


0.4 


1.77 


0.12 


0.67 


1.0% 


0.083 


0.697 


1.1% 



10.3 Comments 



The observed data is a bad fit to the model. We have managed not only 
obtain a goodness of fit for the problem (made extreme by the sparsity of 
data), but also to show that the method gives reliable results for a variety 
of smoothing parameters. The method is also robust with respect to the data 
size n. We see that as we increase the smoothing parameter to 0.4, we begin to 
increase the chance of fitting. When h = 1.0, everything will fit. A smoothing 
parameter of h — 0.2 or 0.3 gives reliable results. The probability to exceed 
the observed H LCR. is estimated from the histograms with 1000 experiments. 
We can improve the accuracy of this by running more Monte Carlo statistics. 
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s* arbitrary units 



Fig. 10. Comparison of the usage of Bayesian priors with the new method. In the 
upper figure, illustrating the Bayesian method, an unknown distribution is guessed 
at by the user based on "degrees of belief" and the value of the Bayesian prior 
changes as the variable s changes. In the lower figure, an "unknown concomitant" 
distribution Vn{s*) is used whose shape depends on the statistics of the dataset c^. 
In the case of no bias, this distribution peaks at the true value st- As wc change s*, 
we change our hypothesis as to where the true value of s lies, and the distribution 
shifts with s* as explained in the text. The value of the distribution at the true 
value is thus independent of s* . 
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Fig. 11. (a)The distribution of s*, the maximum likelihood value of s for a 1000 
member ensemble of datasets of n = 100. (b)The goodness of fit variable NCCIZ for 
the fits (c)The likelihood ratio Cr{s*) as a function of s* for the first 10 members 
of the ensemble (d) The function P{s*\c^) for the first 10 members of the ensemble. 
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Fig. 12. (a) The function Vn{s*) computed on the ensemble for n=100 and N=1000. 
The two iterations are shown, with the numbers (1,2) indicating the iteration num- 
ber, (b) The function P{s*\c^) for two elements on the ensemble for the two itera- 
tions. 
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Fig. 13. (a)The distribution of s* (solid histogram) for an ensemble with N=1000 
elements each consisting of a dataset n=100. The curve is the estimate for the 
iterated function Vnis*) for this ensemble normalized to the 1000 observations, (b) 
Vn{s*) on the ensemble for n=100 and n=200. This illustrates that the ensemble 
averaged function, depends on n, the size of the dataset. As n increases, the function 
narrows and the value of the function at its maximum increases. 
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Fig. 14. (a)Histogram of s* values over an ensemble of N=1000. Superimposed is our 
first iteration of the function Vn{s*)- (b) Histogram of c values over an ensemble 
of N=1000. Superimposed is our first iteration for the function -P(c). The RMS 
values refer to the width of the histogram. The first iteration curves are too wide 
as explained in the text. 
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Fig. 15. (a)Histogram of s* values over an ensemble of N=1000. Superimposed is our 
second iteration of the function Vn{s*)- (b) Histogram of c values over an ensemble 
of N=1000. Superimposed is our second iteration for the function -P(c). 
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Fig. 16. Joint probability P(c, s*) computed from P(c\s*)P{s*) at the end of two 
iterations 
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Fig. 17. Joint probability P(c, s*) computed from P{s*\c)P{c) at the end of two 
iterations 
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Fig. 18. Transformed co-ordinates in hypercube space. 
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19. The distribution of MCCR as a function of the smoothing parameter 
0.2,0.3,0.4 for a dataset n = 9 generated to be uniform in the hypercube. 
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20. The distribution of NCCIZ as a function of the smoothing parameter 
0.2, 0.3, 0.4 for a dataset n = 100 generated to be uniform in the hypercube. 
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